Search CORE

14 research outputs found

T-Crowd: Effective Crowdsourcing for Tabular Data

Author: Cheng Reynold
Huang Zhipeng
Li Guoliang
Mamoulis Nikos
Shan Caihua
Zheng Yudian
Publication venue
Publication date: 07/08/2017
Field of study

Crowdsourcing employs human workers to solve computer-hard problems, such as data cleaning, entity resolution, and sentiment analysis. When crowdsourcing tabular data, e.g., the attribute values of an entity set, a worker's answers on the different attributes (e.g., the nationality and age of a celebrity star) are often treated independently. This assumption is not always true and can lead to suboptimal crowdsourcing performance. In this paper, we present the T-Crowd system, which takes into consideration the intricate relationships among tasks, in order to converge faster to their true values. Particularly, T-Crowd integrates each worker's answers on different attributes to effectively learn his/her trustworthiness and the true data values. The attribute relationship information is also used to guide task allocation to workers. Finally, T-Crowd seamlessly supports categorical and continuous attributes, which are the two main datatypes found in typical databases. Our extensive experiments on real and synthetic datasets show that T-Crowd outperforms state-of-the-art methods in terms of truth inference and reducing the cost of crowdsourcing

arXiv.org e-Print Archive

Crossref

Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads

Author: Chen Fanchao
Luo Siqiang
Mo Dingheng
Shan Caihua
Publication venue
Publication date: 14/08/2023
Field of study

LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value store with the following new features: (1) RusKey is a first attempt to orchestrate LSM-tree structures online to enable robust performance under the context of dynamic workloads; (2) RusKey is the first study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3) RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient transition between different compaction policies -- the bottleneck of dynamic key-value stores. We justify the superiority of the new design with theoretical analysis; (4) RusKey requires no prior workload knowledge for system adjustment, in contrast to state-of-the-art techniques. Experiments show that RusKey exhibits strong performance robustness in diverse workloads, achieving up to 4x better end-to-end performance than the RocksDB system under various settings.Comment: 25 pages, 13 figure

arXiv.org e-Print Archive

Biological Factor Regulatory Neural Network

Author: Dai Xinnan
Li Dongsheng
Li Xiaoxiao
Shan Caihua
Zheng Jie
Publication venue
Publication date: 11/04/2023
Field of study

Genes are fundamental for analyzing biological systems and many recent works proposed to utilize gene expression for various biological tasks by deep learning models. Despite their promising performance, it is hard for deep neural networks to provide biological insights for humans due to their black-box nature. Recently, some works integrated biological knowledge with neural networks to improve the transparency and performance of their models. However, these methods can only incorporate partial biological knowledge, leading to suboptimal performance. In this paper, we propose the Biological Factor Regulatory Neural Network (BFReg-NN), a generic framework to model relations among biological factors in cell systems. BFReg-NN starts from gene expression data and is capable of merging most existing biological knowledge into the model, including the regulatory relations among genes or proteins (e.g., gene regulatory networks (GRN), protein-protein interaction networks (PPI)) and the hierarchical relations among genes, proteins and pathways (e.g., several genes/proteins are contained in a pathway). Moreover, BFReg-NN also has the ability to provide new biologically meaningful insights because of its white-box characteristics. Experimental results on different gene expression-based tasks verify the superiority of BFReg-NN compared with baselines. Our case studies also show that the key insights found by BFReg-NN are consistent with the biological literature

arXiv.org e-Print Archive

Label Propagation for Graph Label Noise

Author: Cheng Yao
Li Dongsheng
Li Xiang
Luo Siqiang
Shan Caihua
Shen Yifei
Publication venue
Publication date: 25/10/2023
Field of study

Label noise is a common challenge in large datasets, as it can significantly degrade the generalization ability of deep neural networks. Most existing studies focus on noisy labels in computer vision; however, graph models encompass both node features and graph topology as input, and become more susceptible to label noise through message-passing mechanisms. Recently, only a few works have been proposed to tackle the label noise on graphs. One major limitation is that they assume the graph is homophilous and the labels are smoothly distributed. Nevertheless, real-world graphs may contain varying degrees of heterophily or even be heterophily-dominated, leading to the inadequacy of current methods. In this paper, we study graph label noise in the context of arbitrary heterophily, with the aim of rectifying noisy labels and assigning labels to previously unlabeled nodes. We begin by conducting two empirical analyses to explore the impact of graph homophily on graph label noise. Following observations, we propose a simple yet efficient algorithm, denoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three steps: (1) reconstruct the graph to recover the homophily property, (2) utilize label propagation to rectify the noisy labels, (3) select high-confidence labels to retain for the next iteration. By iterating these steps, we obtain a set of correct labels, ultimately achieving high accuracy in the node classification task. The theoretical analysis is also provided to demonstrate its remarkable denoising "effect". Finally, we conduct experiments on 10 benchmark datasets under varying graph heterophily levels and noise types, comparing the performance of LP4GLN with 7 typical baselines. Our results illustrate the superior performance of the proposed LP4GLN

arXiv.org e-Print Archive

Truth Inference in Crowdsourcing: Is the Problem Solved?

Author: Caihua Shan
Guoliang Li
Reynold Cheng
Yuanbing Li
Yudian Zheng
Publication venue
Publication date: 01/01/2017
Field of study

Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions

CiteSeerX

Learning Decomposed Spatial Relations for Multi-Variate Time-Series Modeling

Author: Fang Yuchen
Li Dongsheng
Li You
Ren Kan
Shan Caihua
Shen Yifei
Yu Yong
Zhang Weinan
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 26/06/2023
Field of study

Modeling multi-variate time-series (MVTS) data is a long-standing research subject and has found wide applications. Recently, there is a surge of interest in modeling spatial relations between variables as graphs, i.e., first learning one static graph for each dataset and then exploiting the graph structure via graph neural networks. However, as spatial relations may differ substantially across samples, building one static graph for all the samples inherently limits flexibility and severely degrades the performance in practice. To address this issue, we propose a framework for fine-grained modeling and utilization of spatial correlation between variables. By analyzing the statistical properties of real-world datasets, a universal decomposition of spatial correlation graphs is first identified. Specifically, the hidden spatial relations can be decomposed into a prior part, which applies across all the samples, and a dynamic part, which varies between samples, and building different graphs is necessary to model these relations. To better coordinate the learning of the two relational graphs, we propose a min-max learning paradigm that not only regulates the common part of different dynamic graphs but also guarantees spatial distinguishability among samples. The experimental results show that our proposed model outperforms the state-of-the-art baseline methods on both time-series forecasting and time-series point prediction tasks

Association for the Advancement of Artificial Intelligence: AAAI Publications

A predictive method to determine incomplete electronic medical records

Author: Liu Caihua
Logan Judith R.
Ncibi Kais
Otunba Rasaq
Santos Miriam Seoane
Sperrin Matthew
Suthaharan Shan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/06/2018
Field of study

© 2018 Association for Computing Machinery. This paper is utilizing predictive models to determine missing electronic medical records (EMR) at general practice offices. Prior research has addressed the missing values problem in the EMRs used for secondary analysis. However, health care providers are overlooking the missing records problem that stores the patients’ medical visits information in EMRs. Our study provides a technique to predict the number of EMR entries for each practice based on their past data records. If the number of EMR entries is less than predicted, it warns the occurrence of missing records with the 95% confidence interval. The study uses seven years of EMRs from 14 general practice offices to train the predictive model. The model predicts EMR data entries and accordingly identified missing EMRs for the following year. We compared the actual visits illustrated by de-identified billing data to the predictive model. The study found auto-correlation method improves the performance of identifying missing records by detecting the period of prediction. In addition, artificial neural networks and support vector machines perform better than other predictive methods depending on whether the analysis aims at detecting missing EMRs or when identifying complete EMRs with no missing records. Results suggest that clinicians and medical professionals should be mindful of the potential missing records of EMRs prior any secondary analysis

Crossref

OPUS - University of Technology Sydney